Importance of Explicit Vectorization for CPU and GPU Software Performance

نویسندگان

  • Neil G. Dickson
  • Kamran Karimi
  • Firas Hamze
چکیده

Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization and non-parallel optimization techniques, which can often be employed additionally, are less frequently discussed. In this paper, we present an analysis of several optimizations done on both central processing unit (CPU) and GPU implementations of a particular computationally intensive Metropolis Monte Carlo algorithm. Explicit vectorization on the CPU and the equivalent, explicit memory coalescing, on the GPU are found to be critical to achieving good performance of this algorithm in both environments. The fully-optimized CPU version achieves a 9x to 12x speedup over the original CPU version, in addition to speedup from multithreading. This is 2x faster than the fully-optimized GPU version.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Accelerating High-Dimensional Nearest Neighbors for Video Search

The k-nearest neighbor algorithm (kNN) is a critical algorithm used extensively in fields such as Computer Vision, Robotics, and Machine Learning. In this work, we address the performance of FLANN, a popular kNN library, at the node-level by co-designing indexing and search algorithms with software support. We characterize, profile, and optimize FLANN for high-dimensionality (e.g., ≥ 4096) for ...

متن کامل

Multi-tier Dynamic Vectorization for Translating GPU Optimizations into CPU Performance

Developing high performance GPU code is labor intensive. Ideally, developers could recoup high GPU development costs by generating high-performance programs for CPUs and other architectures from the same source code. However, current OpenCL compilers for non-GPUs do not fully exploit optimizations in well-tuned GPU codes. To address this problem, we develop an OpenCL implementation that efficie...

متن کامل

A Study of Productivity and Performance of Modern Vector Processors

This bachelor thesis carries out a case study describing the performance and productivity of modern vector processors such as graphics processing units (GPUs) and central processing units (CPUs) based on three different computational routines arising from a magnetoencephalography application. I apply different programming paradigms to these routines targeting either the CPU or the GPU. Furtherm...

متن کامل

Cross-Platform OpenCL Code and Performance Portability for CPU and GPU Architectures Investigated with a Climate and Weather Physics Model

Current multiand many-core computing typically involves multi-core Central Processing Units (CPU) and many-core Graphical Processing Units (GPU) whose architectures are distinctly different. To keep longevity of application codes, it is highly desirable to have a programming paradigm to support these current and future architectures. Open Computing Language (OpenCL) is created to address this p...

متن کامل

Explicit Vector Programming with OpenMP 4.0 SIMD Extensions

Modern CPU and GPU processors with on-die integration of SIMD execution units for achieving higher performance and power efficiency have posed challenges to use the underlying SIMD hardware (or VPUs, Vector Processing Unit) effectively. Wide vector registers and SIMD instructions –Single Instructions operating on Multiple Data elements packed in wide registers such as AltiVec [2], SSE, AVX[10] ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • J. Comput. Physics

دوره 230  شماره 

صفحات  -

تاریخ انتشار 2011